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1  Introduction 

The  success  of  MLFMA  in  solving  large  scale  problems  has  naturally  led  to  efforts 
in  parallelizing  the  algorithm.  Several  academic  and  industrial  research  groups  have 
made  significant  progress  in  their  attempts  [1,  2,  See  references  therein].  We  have 
shown  that  the  parallel  implementation  of  MLFMA  developed  at  the  University  of 
Illinois,  called  ScaleME,  has  excellent  scaling  properties  [2], 

In  this  paper,  we  summarize  the  results  of  our  efforts  in  developing  a  scalable  dis¬ 
tributed  memory  fast  electromagnetic  integral  equation  solver.  The  massively  paral¬ 
lel  scattering  code,  called  LSSP,  is  based  on  a  Galerkin  discretization  of  the  combined 
field  integral  equation  (CFIE)  using  the  RWG  basis  functions.  The  resulting  matrix 
equation  is  then  solved  using  a  parallel  GMRES  solver  [3].  It  uses  the  distributed 
memory  parallel  MLFMA  library  called  ScaleME  for  evaluating  matrix-vector  prod¬ 
ucts  and  Message  Passing  Interface  (MPI)  for  inter-processor  communication.  Fur¬ 
thermore,  once  the  solution  is  obtained,  the  bistatic  RCS  is  computed  using  a  par¬ 
allelized,  MLFMA  based,  far  field  evaluation  algorithm  [I].  The  objective  of  this 
paper  is  to  present  some  recent  results  demonstrating  the  scalability  of  the  code  for 
solving  realistic  problems. 

2  Summary  of  Parallelization  of  MLFMA 

The  basic  idea  in  parallelizing  MLFMA  can  be  described  as  follows:  for  the  top 
(coarse)  few  levels^  replicate  the  boxes  in  every  processor ^  but  split  the  far  field 
patterns  equally  among  all  processors.  For  the  finer  levels,  divide  the  boxes  at  each 
level  equally  among  the  processors. 

The  levels  that  are  replicated  in  every  processor  are  called  ’‘shared’*  levels,  and  the 
levels  for  which  the  grain  size  is  retained  to  be  a  box  are  called  “distributed”  levels. 
Clearly,  this  scheme  divides  the  tree  horizontally  into  different  layers,  each  layer 
consisting  of  one  or  more  levels.  These  overlapping  laj'ers  are  the  shared  layer,  the 
transition  layer,  and  the  distributed  layer.  Each  layer  has  distinct  communication 
and  computational  behaviors.  For  instance,  for  the  distributed  layer,  communica¬ 
tions  are  necessary  during  all  three  phases,  viz.  the  aggregation,  translation,  and 
disaggregation  phases.  However,  for  the  shared  and  transition  layers,  no  communi¬ 
cation  is  necessary  during  the  translation  phase.  Furthermore,  during  the  aggrega¬ 
tion  and  disaggregation  phases,  these  two  layers  require  communication  of  partial 
radiation/receiving  patterns.  In  fact,  during  the  aggregation  and  the  disaggrega¬ 
tion  phases,  the  parallel  interpolation  and  anterpolation  operations  are  required. 


However,  since  each  processor  has  only  iV^/p  samples  of  the  far  field  pattern,  the 
maximum  length  of  the  messages  is  bounded  by  the  same  amount.  Indeed,  using  a 
local  interpolation/anterpolation  scheme,  the  length  of  the  message  can  be  reduced 
from  0{N)  to  0{\/N)  [2], 


3  Parallel  MLFMA  based  RCS  computations 


The  evaluation  of  RCS  after  solving  the  matrix  equation  is  computationally  expen¬ 
sive.  To  see  this,  note  that  evaluating  the  RCS  in  a  single  direction  requires  0{N) 
operations.  For  large  scale  problems,  the  number  of  directions  in  which  the  RCS 
is  sought  on  one  plane  cut  is  0{VN),  thereby  requiring  a  total  evaluation  time  of 
Q(jyi.5)  Furthermore,  the  proportionality  constant  is  rather  large  owing  to  the 
various  geometric  computations  involved. 


In  this  section,  we  briefly  discuss  a  method  by  which  the  bistatic  RCS  can  be  eval¬ 
uated  rapidly  using  parallel  computers.  The  bistatic  RCS  can  be  written  as. 


cr(0,0)  = 


47r 


N 

1^1  Js 


(1) 


where  is  the  z-th  basis  function  and  ai  the  corresponding  coefficient.  Also,  we 
have  set  s  =  (1, 0, 0).  For  a  given  point  c  E 

F„(s)  =  fc(i  -  ss)  •  ^  dS',  (2) 

is  the  radiation  pattern  of  the  m-th  basis  function.  By  setting  c  =  0,  we  see  that 
the  summand  in  Equation  (1)  is  the  radiation  pattern  of  the  m-th  basis  function 
with  respect  to  the  origin. 


Since  we  have  the  radiation  patterns  of  the  basis  functions  with  respect  to  the  finest 
level  of  MLFMA  tree,  we  can  compute  the  sum  in  Equation  (1)  using  the  upward 
pass  of  MLFMA.  However,  the  upward  pass  needs  to  be  supplemented  with  an  in¬ 
terpolation  at  the  top- most  level.  This  interpolation  must  be  parallelized  efficiently. 


Depending  on  whether  the  algorithm  uses  shared  levels  or  not,  the  parallelization 
requires  two  different  approaches.  In  the  absence  of  shared  levels,  parallelization 
involves  only  a  global  reduction  operation  after  the  interpolation  to  the  root  box. 
When  shared  levels  are  present,  the  algorithm  has  to  take  into  account  sparsity  in 
the  data  structures  and  thus  is  more  involved.  The  details  of  these  methods  are 
described  in  [1]. 


4  Numerical  Results 

We  have  verified  the  correctness  and  the  accuracy  of  the  code  by  comparing  with 
analytical  solutions  for  perfectly  conducting  spheres,  as  well  as  by  comparing  with 
other  results  available  from  literature.  Here  we  present  two  sets  of  results  demon¬ 
strating  the  scalability  of  the  methods  employed.  Let  p  be  the  number  of  processors 
and  Ti  and  Tp  be  the  time  taken  by  the  algorithm  on  one  processor  and  on  p  pro¬ 
cessors,  respectively.  Then,  the  parallel  efficiency  is  r}  =  Ti/{pTp). 


First,  we  demonstrate  the  parallel  efficiency  of  MLFMA  for  evaluating  matrix  vector 
products.  For  this,  we  consider  the  scattering  from  a  15A  perfectly  conducting  cube 
modeled  using  294,912  unknowns.  The  parallel  efficiency  is  plotted  in  Figure  1.  The 
figure  demonstrates  that  the  shared  levels  improves  the  efficiency  significantly  as  the 
number  of  processors  is  increased.  This  behavior  is  consistent  with  the  theoretical 
analysis  [2].  However,  it  may  also  be  noted  that  by  increasing  the  number  of  shared 
levels  from  2  to  3  does  not  improve  the  performance  very  much.  In  fact,  we  have 
observed  that  depending  on  the  geometry,  there  is  a  range  of  shared  levels  for  which 
the  performance  is  more  or  less  the  same.  Finally,  the  “superlinear”  performance 
exhibited  for  a  small  number  of  processors  is  a  result  of  machine  dependent  memory 
hierarchy. 


CubeL  =  15A,A=:294,912 


Figure  1:  Parallel  efficiency  for  the  case  CUBE- 15 A.  SLEV  refers  to  the  finest  shared 
level  used  in  the  simulation. 


Next,  we  demonstrate  the  scalability  of  the  code  with  respect  to  the  number  of 
unknowns  using  a  full-size,  fictitious  aircraft,  referred  to  as  VFY-218.  We  study 
the  model  with  625,626  unknowns  at  2  GHz  and  2,464,536  unknowns  at  4  GHz. 
The  total  run  time  for  the  first  case  was  about  2  hours  and  13  minutes  and  about 
5  hours  for  the  second.  The  time  for  evaluating  the  matrix-vector  product  and  for 
evaluating  the  RCS  for  1800  directions  are  given  in  Table  1.  The  results  demonstrate 
very  good  scaling. 


Num.  Proc 

MatVec  Time(s) 

RCS  Eval.  Time(s) 

2  GHz 

4  GHz 

2  GHz 

4  GHz 

16 

59.55 

10.71 

32 

30.87 

111.16 

6.40 

20.02 

64 

64.41 

11.35 

Table  1:  Demonstration  of  scalability  with  respect  to  problem  size  for  the  full  scale 
model  of  an  aircraft,  VFY-218.  At  2  GHz,  the  number  of  unknowns  N  —  625, 626 
and  at  4  GHz,  AT  2,464,536. 


Finally,  we  present  the  bistatic  RCS  of  the  aircraft  model  at  8  GHz,  The  tour-de- 
force  simulation  involved  10, 186, 446  unknowns  and  we  used  a  10-level  MLFMA, 
and  126  processors  of  an  SGI  Origin  2000  supercomputer.  The  total  solution  time 
was  7  hours  and  25  minutes  and  each  matrix- vector  product  evaluation  required  119 
seconds.  The  latter  results  again  shows  the  scaling  with  respect  to  the  problem  size. 
The  bistatic  RCS  for  the  vertical  polarization  with  an  incident  direction  of  (90, 90) 
is  shown  in  Figure  2. 


Bistatic  RCS  of  an  Aircraft  (VFY-218)  at  8GHz.  N  =  10, 186, 446 


Figure  2:  Bistatic  RCS  of  VFY-218  at  8  GHz.  N  =  10,186,446. 


5  Conclusions 

The  objective  of  the  paper  was  to  present  a  brief  summary  of  the  scalable  parallel 
code  we  have  developed  for  electromagnetic  scattering  computations.  We  have  pre¬ 
sented  representative  results  demonstrating  the  excellent  scalability  obtained.  More 
detailed  results  will  be  presented  at  the  conference. 
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